library("readxl")
library(dplyr)
library(tidyr)
library(stats)
library(ggplot2)
library(car)
library(knitr)
library(rstatix)
library(evaluator)
library(multcomp)
library(vegan)
library(plotly)
library(ggthemes)
theme_set(theme_bw())
Briefly about the article: in this work, the researchers found out whether memantine is able to positively influence the learning ability of down mice (trisomy on chromosome 21). They described the effects of memantine on protein expression in brain regions of Ts65Dn mice exposed to context fear conditioning (CFC).
For this purpose, the following experiment was set up: mice were placed in a novel cage, allowed to explore for three minutes and then given an electric shock. These mice are the context-shock (CS) group and learn to associate the context with the aversive stimulus. Learning is displayed by “freezing” upon re-exposure to the context, where freezing is defined as a lack of movement. A second group of mice were placed in the novel cage, immediately given the electric shock, and then allowed to explore for 3 min. These mice are the shock-context (SC) group and do not acquire conditioned fear. All mice received an injection of memantine or the equivalent volume of saline 15 minutes prior to exposure to the novel context. Eight groups of mice (7-10 per group) were used: trisomic and controls, CS mice injected with saline or with memantine and SC mice injected with saline or with memantine, as described
Experiment scheme
Classes:
c-CS-s: control mice, stimulated to learn, injected with saline (9 mice)
c-CS-m: control mice, stimulated to learn, injected with memantine (10 mice)
c-SC-s: control mice, not stimulated to learn, injected with saline (9 mice)
c-SC-m: control mice, not stimulated to learn, injected with memantine (10 mice)
t-CS-s: trisomy mice, stimulated to learn, injected with saline (7 mice)
t-CS-m: trisomy mice, stimulated to learn, injected with memantine (9 mice)
t-SC-s: trisomy mice, not stimulated to learn, injected with saline (9 mice)
t-SC-m: trisomy mice, not stimulated to learn, injected with memantine (9 mice)
First, let’s read the data and see how many observations we have
mouse_data <- read_excel('Data_Cortex_Nuclear.xls')
count(mouse_data)
## # A tibble: 1 x 1
## n
## <int>
## 1 1080
We see that there are much more observations than the mice were taken for the experiment (1080 vs 72). This means that repeated observations were made for the mice. Which mouse the repetition belongs to can serve as an important factor in further analysis, so we need to split the MouseID column into two: the ID itself and the repeatability
mouses <- mouse_data %>% separate(col = MouseID, into = c("MouseID", "measurement"), sep = "_")
Let’s also examine the data structure.
str(mouses)
## tibble [1,080 × 83] (S3: tbl_df/tbl/data.frame)
## $ MouseID : chr [1:1080] "309" "309" "309" "309" ...
## $ measurement : chr [1:1080] "1" "2" "3" "4" ...
## $ DYRK1A_N : num [1:1080] 0.504 0.515 0.509 0.442 0.435 ...
## $ ITSN1_N : num [1:1080] 0.747 0.689 0.73 0.617 0.617 ...
## $ BDNF_N : num [1:1080] 0.43 0.412 0.418 0.359 0.359 ...
## $ NR1_N : num [1:1080] 2.82 2.79 2.69 2.47 2.37 ...
## $ NR2A_N : num [1:1080] 5.99 5.69 5.62 4.98 4.72 ...
## $ pAKT_N : num [1:1080] 0.219 0.212 0.209 0.223 0.213 ...
## $ pBRAF_N : num [1:1080] 0.178 0.173 0.176 0.176 0.174 ...
## $ pCAMKII_N : num [1:1080] 2.37 2.29 2.28 2.15 2.13 ...
## $ pCREB_N : num [1:1080] 0.232 0.227 0.23 0.207 0.192 ...
## $ pELK_N : num [1:1080] 1.75 1.6 1.56 1.6 1.5 ...
## $ pERK_N : num [1:1080] 0.688 0.695 0.677 0.583 0.551 ...
## $ pJNK_N : num [1:1080] 0.306 0.299 0.291 0.297 0.287 ...
## $ PKCA_N : num [1:1080] 0.403 0.386 0.381 0.377 0.364 ...
## $ pMEK_N : num [1:1080] 0.297 0.281 0.282 0.314 0.278 ...
## $ pNR1_N : num [1:1080] 1.022 0.957 1.004 0.875 0.865 ...
## $ pNR2A_N : num [1:1080] 0.606 0.588 0.602 0.52 0.508 ...
## $ pNR2B_N : num [1:1080] 1.88 1.73 1.73 1.57 1.48 ...
## $ pPKCAB_N : num [1:1080] 2.31 2.04 2.02 2.13 2.01 ...
## $ pRSK_N : num [1:1080] 0.442 0.445 0.468 0.478 0.483 ...
## $ AKT_N : num [1:1080] 0.859 0.835 0.814 0.728 0.688 ...
## $ BRAF_N : num [1:1080] 0.416 0.4 0.4 0.386 0.368 ...
## $ CAMKII_N : num [1:1080] 0.37 0.356 0.368 0.363 0.355 ...
## $ CREB_N : num [1:1080] 0.179 0.174 0.174 0.179 0.175 ...
## $ ELK_N : num [1:1080] 1.87 1.76 1.77 1.29 1.32 ...
## $ ERK_N : num [1:1080] 3.69 3.49 3.57 2.97 2.9 ...
## $ GSK3B_N : num [1:1080] 1.54 1.51 1.5 1.42 1.36 ...
## $ JNK_N : num [1:1080] 0.265 0.256 0.26 0.26 0.251 ...
## $ MEK_N : num [1:1080] 0.32 0.304 0.312 0.279 0.274 ...
## $ TRKA_N : num [1:1080] 0.814 0.781 0.785 0.734 0.703 ...
## $ RSK_N : num [1:1080] 0.166 0.157 0.161 0.162 0.155 ...
## $ APP_N : num [1:1080] 0.454 0.431 0.423 0.411 0.399 ...
## $ Bcatenin_N : num [1:1080] 3.04 2.92 2.94 2.5 2.46 ...
## $ SOD1_N : num [1:1080] 0.37 0.342 0.344 0.345 0.329 ...
## $ MTOR_N : num [1:1080] 0.459 0.424 0.425 0.429 0.409 ...
## $ P38_N : num [1:1080] 0.335 0.325 0.325 0.33 0.313 ...
## $ pMTOR_N : num [1:1080] 0.825 0.762 0.757 0.747 0.692 ...
## $ DSCR1_N : num [1:1080] 0.577 0.545 0.544 0.547 0.537 ...
## $ AMPKA_N : num [1:1080] 0.448 0.421 0.405 0.387 0.361 ...
## $ NR2B_N : num [1:1080] 0.586 0.545 0.553 0.548 0.513 ...
## $ pNUMB_N : num [1:1080] 0.395 0.368 0.364 0.367 0.352 ...
## $ RAPTOR_N : num [1:1080] 0.34 0.322 0.313 0.328 0.312 ...
## $ TIAM1_N : num [1:1080] 0.483 0.455 0.447 0.443 0.419 ...
## $ pP70S6_N : num [1:1080] 0.294 0.276 0.257 0.399 0.393 ...
## $ NUMB_N : num [1:1080] 0.182 0.182 0.184 0.162 0.16 ...
## $ P70S6_N : num [1:1080] 0.843 0.848 0.856 0.76 0.768 ...
## $ pGSK3B_N : num [1:1080] 0.193 0.195 0.201 0.184 0.186 ...
## $ pPKCG_N : num [1:1080] 1.44 1.44 1.52 1.61 1.65 ...
## $ CDK5_N : num [1:1080] 0.295 0.294 0.302 0.296 0.297 ...
## $ S6_N : num [1:1080] 0.355 0.355 0.386 0.291 0.309 ...
## $ ADARB1_N : num [1:1080] 1.34 1.31 1.28 1.2 1.21 ...
## $ AcetylH3K9_N : num [1:1080] 0.17 0.171 0.185 0.16 0.165 ...
## $ RRP1_N : num [1:1080] 0.159 0.158 0.149 0.166 0.161 ...
## $ BAX_N : num [1:1080] 0.189 0.185 0.191 0.185 0.188 ...
## $ ARC_N : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
## $ ERBB4_N : num [1:1080] 0.145 0.15 0.145 0.141 0.142 ...
## $ nNOS_N : num [1:1080] 0.177 0.178 0.176 0.164 0.168 ...
## $ Tau_N : num [1:1080] 0.125 0.134 0.133 0.123 0.137 ...
## $ GFAP_N : num [1:1080] 0.115 0.118 0.118 0.117 0.116 ...
## $ GluR3_N : num [1:1080] 0.228 0.238 0.245 0.235 0.256 ...
## $ GluR4_N : num [1:1080] 0.143 0.142 0.142 0.145 0.141 ...
## $ IL1B_N : num [1:1080] 0.431 0.457 0.51 0.431 0.481 ...
## $ P3525_N : num [1:1080] 0.248 0.258 0.255 0.251 0.252 ...
## $ pCASP9_N : num [1:1080] 1.6 1.67 1.66 1.48 1.53 ...
## $ PSD95_N : num [1:1080] 2.01 2 2.02 1.96 2.01 ...
## $ SNCA_N : num [1:1080] 0.108 0.11 0.108 0.12 0.12 ...
## $ Ubiquitin_N : num [1:1080] 1.045 1.01 0.997 0.99 0.998 ...
## $ pGSK3B_Tyr216_N: num [1:1080] 0.832 0.849 0.847 0.833 0.879 ...
## $ SHH_N : num [1:1080] 0.189 0.2 0.194 0.192 0.206 ...
## $ BAD_N : num [1:1080] 0.123 0.117 0.119 0.133 0.13 ...
## $ BCL2_N : num [1:1080] NA NA NA NA NA NA NA NA NA NA ...
## $ pS6_N : num [1:1080] 0.106 0.107 0.108 0.103 0.105 ...
## $ pCFOS_N : num [1:1080] 0.108 0.104 0.106 0.111 0.111 ...
## $ SYP_N : num [1:1080] 0.427 0.442 0.436 0.392 0.434 ...
## $ H3AcK18_N : num [1:1080] 0.115 0.112 0.112 0.13 0.118 ...
## $ EGR1_N : num [1:1080] 0.132 0.135 0.133 0.147 0.14 ...
## $ H3MeK4_N : num [1:1080] 0.128 0.131 0.127 0.147 0.148 ...
## $ CaNA_N : num [1:1080] 1.68 1.74 1.93 1.7 1.84 ...
## $ Genotype : chr [1:1080] "Control" "Control" "Control" "Control" ...
## $ Treatment : chr [1:1080] "Memantine" "Memantine" "Memantine" "Memantine" ...
## $ Behavior : chr [1:1080] "C/S" "C/S" "C/S" "C/S" ...
## $ class : chr [1:1080] "c-CS-m" "c-CS-m" "c-CS-m" "c-CS-m" ...
We see that such columns as ID, genotype, treatment, behavior and class are defined as character, however it is better to change them to a factor
mouses$class <- as.factor(mouses$class)
mouses$Behavior <- as.factor(mouses$Behavior)
mouses$Treatment <- as.factor(mouses$Treatment)
mouses$Genotype <- as.factor(mouses$Genotype)
mouses$MouseID <- as.factor(mouses$MouseID)
Let’s also take a look at the representation of the classes. We can argue that the groups are more or less balanced
class_count <- mouses %>% group_by(class) %>% count()
class_count
## # A tibble: 8 x 2
## # Groups: class [8]
## class n
## <fct> <int>
## 1 c-CS-m 150
## 2 c-CS-s 135
## 3 c-SC-m 150
## 4 c-SC-s 135
## 5 t-CS-m 135
## 6 t-CS-s 105
## 7 t-SC-m 135
## 8 t-SC-s 135
Let’s also see how many missing values there are in the data and for whшср indicators. We also find out how many complete observations are in the dataset.
mouse_na <- colSums(is.na(mouses))
sum(is.na(mouses))
## [1] 1396
as.table(mouse_na)
## MouseID measurement DYRK1A_N ITSN1_N BDNF_N
## 0 0 3 3 3
## NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N
## 3 3 3 3 3
## pCREB_N pELK_N pERK_N pJNK_N PKCA_N
## 3 3 3 3 3
## pMEK_N pNR1_N pNR2A_N pNR2B_N pPKCAB_N
## 3 3 3 3 3
## pRSK_N AKT_N BRAF_N CAMKII_N CREB_N
## 3 3 3 3 3
## ELK_N ERK_N GSK3B_N JNK_N MEK_N
## 18 3 3 3 7
## TRKA_N RSK_N APP_N Bcatenin_N SOD1_N
## 3 3 3 18 3
## MTOR_N P38_N pMTOR_N DSCR1_N AMPKA_N
## 3 3 3 3 3
## NR2B_N pNUMB_N RAPTOR_N TIAM1_N pP70S6_N
## 3 3 3 3 3
## NUMB_N P70S6_N pGSK3B_N pPKCG_N CDK5_N
## 0 0 0 0 0
## S6_N ADARB1_N AcetylH3K9_N RRP1_N BAX_N
## 0 0 0 0 0
## ARC_N ERBB4_N nNOS_N Tau_N GFAP_N
## 0 0 0 0 0
## GluR3_N GluR4_N IL1B_N P3525_N pCASP9_N
## 0 0 0 0 0
## PSD95_N SNCA_N Ubiquitin_N pGSK3B_Tyr216_N SHH_N
## 0 0 0 0 0
## BAD_N BCL2_N pS6_N pCFOS_N SYP_N
## 213 285 0 75 0
## H3AcK18_N EGR1_N H3MeK4_N CaNA_N Genotype
## 180 210 270 0 0
## Treatment Behavior class
## 0 0 0
#
mouses_without_na <- na.omit(mouses)
count(mouses_without_na)
## # A tibble: 1 x 1
## n
## <int>
## 1 552
There are 8 classes of mice in this dataset; 15 technical replicates were carried out for each of 72 mice. For pairwise comparison of a large number of classes, it is necessary to use ANOVA. I decided to consider the class to which the mouse and the mouse belong, as individual characteristics play an important role in the experiment. I did not use each of the class parameters separately, otherwise the analysis would be very cumbersome and complex. Unfortunately, I was unable to carry out anew test with the model lm(BDNF_N ~ class + MouseID , data = mouses). However, carrying out an analysis without taking into account the fact that we have repetitions in the data, I think it is incorrect. Below I have given the results of anova in case we do not take into account repetitions, and if we do take into account (for this I took the average values for each of the mice).
First, let’s group our data by class and ID variables and look at the summary statistics for the variable BDNF_N. I will also use the obtained statistics for anova based on averaged BDNF_N expression values.
BDNF_mean <- mouses %>%
group_by(class, MouseID) %>%
get_summary_stats(BDNF_N, type = "mean_sd")
BDNF_mean
## # A tibble: 72 x 6
## MouseID class variable n mean sd
## <fct> <fct> <chr> <dbl> <dbl> <dbl>
## 1 309 c-CS-m BDNF_N 15 0.356 0.038
## 2 311 c-CS-m BDNF_N 15 0.335 0.033
## 3 320 c-CS-m BDNF_N 15 0.399 0.037
## 4 321 c-CS-m BDNF_N 15 0.359 0.033
## 5 322 c-CS-m BDNF_N 15 0.315 0.013
## 6 3415 c-CS-m BDNF_N 15 0.315 0.055
## 7 3499 c-CS-m BDNF_N 15 0.288 0.044
## 8 3507 c-CS-m BDNF_N 15 0.317 0.017
## 9 3520 c-CS-m BDNF_N 15 0.341 0.033
## 10 3521 c-CS-m BDNF_N 15 0.366 0.045
## # … with 62 more rows
Let’s remove the missing values of BDNF_N (we can do this since there are only 3 missing observations) and create box plots of the BDNF_N colored by class. We will not display individual mice on the graph, which will not clutter it up.
BDNF_N_without_NA <- mouses %>% filter( !is.na(BDNF_N))
ggplot(BDNF_N_without_NA, aes(class, BDNF_N, color = class)) + stat_summary(fun.data = "mean_cl_normal") + ggtitle(label = "BDNF expression versus class plot")
Let’s build a linear model of the dependence of the BDNF_N expression on the class and ID of the mouse and check the applicability conditions using it (since ANOVA has similar applicability conditions to linear models).
fit <- lm(BDNF_N ~ class , BDNF_N_without_NA)
fit2 <- lm(mean ~ class, BDNF_mean)
We see that if only the mouse class is taken into account, the residuals plot looks good, the distribution looks normal. In the case of using averaged values, the plot of residuals raises doubts: the shift in median values, the dispersion between some groups is very different.
simple_diag <- fortify(fit)
simple_diag2 <- fortify(fit2)
ggplot(simple_diag, aes(x = 1:nrow(simple_diag), y = .cooksd)) +
geom_bar(stat = 'identity')
ggplot(simple_diag2, aes(x = 1:nrow(simple_diag2), y = .cooksd)) +
geom_bar(stat = 'identity')
ggplot(data = simple_diag, aes(x = class, y = .stdresid, colour = class)) +
geom_boxplot() + geom_hline(yintercept = 0)
ggplot(data = simple_diag2, aes(x = class, y = .stdresid, colour = class)) +
geom_boxplot() + geom_hline(yintercept = 0)
qqPlot(fit, id = FALSE)
qqPlot(fit2, id = FALSE)
In both cases, the ANOVA results indicate a difference in protein expression depending on the mouse class. However, in the case of using averaged values, the pvalue is close to 0.05, which may indicate that there is an insignificant difference. For more accurate conclusions, you need to look at the post-hack tests.
res.aov <- anova_test(
data = BDNF_N_without_NA, dv = BDNF_N, wid = MouseID,
between = class
)
res.aov
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 class 7 1069 18.816 8.86e-24 * 0.11
res.aov2 <- anova_test(
data = BDNF_mean, dv = mean, wid = MouseID,
between = class
)
res.aov2
## ANOVA Table (type II tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 class 7 64 2.21 0.045 * 0.195
The results of post-hock tests for a model built without taking into account replicates in the data show a much larger number of classes that differ in protein expression than for a model built using averaged values. That is, without taking into account the individual characteristics of the mouse, we can find more differences in the expression of BDNF. In fact, we can talk about a significant difference in the level of BDNF expression only among the groups c-SC-m & c-CS-s. Also close to the presence of significant differences in the expression (pvalue = 0.06) of the groups c-SC-m & c-CS-m. The obtained data is well displayed by “BDNF expression versus class plot”. It can be assumed that these differences are due to the context of the experiment rather than the use of memantine.
fit_inter <- lm(BDNF_N ~ class , data = BDNF_N_without_NA)
dat_tukey <- glht(fit_inter, linfct = mcp(class = 'Tukey'))
summary(dat_tukey)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = BDNF_N ~ class, data = BDNF_N_without_NA)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## c-CS-s - c-CS-m == 0 0.0030979 0.0055459 0.559 0.9993
## c-SC-m - c-CS-m == 0 -0.0482717 0.0053980 -8.942 <0.01 ***
## c-SC-s - c-CS-m == 0 -0.0258249 0.0055459 -4.657 <0.01 ***
## t-CS-m - c-CS-m == 0 -0.0264852 0.0055459 -4.776 <0.01 ***
## t-CS-s - c-CS-m == 0 -0.0337570 0.0059483 -5.675 <0.01 ***
## t-SC-m - c-CS-m == 0 -0.0181541 0.0055459 -3.273 0.0242 *
## t-SC-s - c-CS-m == 0 -0.0136310 0.0055790 -2.443 0.2213
## c-SC-m - c-CS-s == 0 -0.0513696 0.0055459 -9.263 <0.01 ***
## c-SC-s - c-CS-s == 0 -0.0289228 0.0056900 -5.083 <0.01 ***
## t-CS-m - c-CS-s == 0 -0.0295831 0.0056900 -5.199 <0.01 ***
## t-CS-s - c-CS-s == 0 -0.0368549 0.0060829 -6.059 <0.01 ***
## t-SC-m - c-CS-s == 0 -0.0212520 0.0056900 -3.735 <0.01 **
## t-SC-s - c-CS-s == 0 -0.0167289 0.0057223 -2.923 0.0689 .
## c-SC-s - c-SC-m == 0 0.0224468 0.0055459 4.047 <0.01 **
## t-CS-m - c-SC-m == 0 0.0217865 0.0055459 3.928 <0.01 **
## t-CS-s - c-SC-m == 0 0.0145147 0.0059483 2.440 0.2227
## t-SC-m - c-SC-m == 0 0.0301176 0.0055459 5.431 <0.01 ***
## t-SC-s - c-SC-m == 0 0.0346406 0.0055790 6.209 <0.01 ***
## t-CS-m - c-SC-s == 0 -0.0006603 0.0056900 -0.116 1.0000
## t-CS-s - c-SC-s == 0 -0.0079321 0.0060829 -1.304 0.8972
## t-SC-m - c-SC-s == 0 0.0076708 0.0056900 1.348 0.8797
## t-SC-s - c-SC-s == 0 0.0121939 0.0057223 2.131 0.3951
## t-CS-s - t-CS-m == 0 -0.0072718 0.0060829 -1.195 0.9332
## t-SC-m - t-CS-m == 0 0.0083311 0.0056900 1.464 0.8260
## t-SC-s - t-CS-m == 0 0.0128542 0.0057223 2.246 0.3244
## t-SC-m - t-CS-s == 0 0.0156029 0.0060829 2.565 0.1697
## t-SC-s - t-CS-s == 0 0.0201260 0.0061130 3.292 0.0228 *
## t-SC-s - t-SC-m == 0 0.0045231 0.0057223 0.790 0.9936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
fit_inter2 <- lm(mean ~ class , data = BDNF_mean)
dat_tukey2 <- glht(fit_inter2, linfct = mcp(class = 'Tukey'))
summary(dat_tukey2)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = mean ~ class, data = BDNF_mean)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## c-CS-s - c-CS-m == 0 0.0032333 0.0162035 0.200 1.0000
## c-SC-m - c-CS-m == 0 -0.0482000 0.0157713 -3.056 0.0611 .
## c-SC-s - c-CS-m == 0 -0.0257667 0.0162035 -1.590 0.7538
## t-CS-m - c-CS-m == 0 -0.0264333 0.0162035 -1.631 0.7294
## t-CS-s - c-CS-m == 0 -0.0335286 0.0173792 -1.929 0.5358
## t-SC-m - c-CS-m == 0 -0.0179889 0.0162035 -1.110 0.9522
## t-SC-s - c-CS-m == 0 -0.0129889 0.0162035 -0.802 0.9924
## c-SC-m - c-CS-s == 0 -0.0514333 0.0162035 -3.174 0.0446 *
## c-SC-s - c-CS-s == 0 -0.0290000 0.0166245 -1.744 0.6579
## t-CS-m - c-CS-s == 0 -0.0296667 0.0166245 -1.785 0.6320
## t-CS-s - c-CS-s == 0 -0.0367619 0.0177723 -2.068 0.4451
## t-SC-m - c-CS-s == 0 -0.0212222 0.0166245 -1.277 0.9039
## t-SC-s - c-CS-s == 0 -0.0162222 0.0166245 -0.976 0.9762
## c-SC-s - c-SC-m == 0 0.0224333 0.0162035 1.384 0.8606
## t-CS-m - c-SC-m == 0 0.0217667 0.0162035 1.343 0.8782
## t-CS-s - c-SC-m == 0 0.0146714 0.0173792 0.844 0.9897
## t-SC-m - c-SC-m == 0 0.0302111 0.0162035 1.864 0.5791
## t-SC-s - c-SC-m == 0 0.0352111 0.0162035 2.173 0.3809
## t-CS-m - c-SC-s == 0 -0.0006667 0.0166245 -0.040 1.0000
## t-CS-s - c-SC-s == 0 -0.0077619 0.0177723 -0.437 0.9998
## t-SC-m - c-SC-s == 0 0.0077778 0.0166245 0.468 0.9998
## t-SC-s - c-SC-s == 0 0.0127778 0.0166245 0.769 0.9941
## t-CS-s - t-CS-m == 0 -0.0070952 0.0177723 -0.399 0.9999
## t-SC-m - t-CS-m == 0 0.0084444 0.0166245 0.508 0.9996
## t-SC-s - t-CS-m == 0 0.0134444 0.0166245 0.809 0.9920
## t-SC-m - t-CS-s == 0 0.0155397 0.0177723 0.874 0.9873
## t-SC-s - t-CS-s == 0 0.0205397 0.0177723 1.156 0.9412
## t-SC-s - t-SC-m == 0 0.0050000 0.0166245 0.301 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Let’s built a linear model of the dependence of ERBB4_N expression on other proteins. We can notice that one of the proteins strongly influences the expression of the protein of interest PSD95_N. Three more proteins with good pvalues can also be distinguished: BAX_N , Tau_N , pPKCG_N.
lin_mod <- lm(ERBB4_N ~ . - MouseID - measurement - Genotype - Treatment -Behavior - class , mouses)
summary(lin_mod)
##
## Call:
## lm(formula = ERBB4_N ~ . - MouseID - measurement - Genotype -
## Treatment - Behavior - class, data = mouses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.023037 -0.003370 -0.000127 0.003071 0.031955
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.566e-02 8.245e-03 3.112 0.001970 **
## DYRK1A_N -6.824e-03 7.475e-03 -0.913 0.361710
## ITSN1_N 8.127e-03 1.058e-02 0.768 0.442776
## BDNF_N 3.833e-02 1.919e-02 1.997 0.046378 *
## NR1_N -1.409e-02 6.685e-03 -2.108 0.035544 *
## NR2A_N -1.049e-04 1.361e-03 -0.077 0.938610
## pAKT_N 7.961e-02 3.044e-02 2.615 0.009203 **
## pBRAF_N -4.103e-02 3.065e-02 -1.339 0.181356
## pCAMKII_N -1.451e-05 7.954e-04 -0.018 0.985451
## pCREB_N -6.015e-02 2.910e-02 -2.067 0.039290 *
## pELK_N 4.731e-05 1.030e-03 0.046 0.963391
## pERK_N -1.987e-02 7.887e-03 -2.519 0.012109 *
## pJNK_N -6.035e-02 2.538e-02 -2.378 0.017814 *
## PKCA_N 8.377e-03 2.736e-02 0.306 0.759625
## pMEK_N 8.498e-03 2.706e-02 0.314 0.753676
## pNR1_N -2.566e-02 1.334e-02 -1.924 0.054968 .
## pNR2A_N 1.004e-02 7.239e-03 1.388 0.165927
## pNR2B_N 1.928e-02 6.659e-03 2.896 0.003959 **
## pPKCAB_N 5.727e-03 2.773e-03 2.065 0.039457 *
## pRSK_N 1.001e-02 1.427e-02 0.701 0.483573
## AKT_N -6.738e-04 8.183e-03 -0.082 0.934413
## BRAF_N 1.550e-02 1.213e-02 1.278 0.201926
## CAMKII_N -3.429e-03 2.011e-02 -0.170 0.864700
## CREB_N -1.450e-02 3.091e-02 -0.469 0.639135
## ELK_N 3.510e-03 3.719e-03 0.944 0.345811
## ERK_N 4.660e-03 2.804e-03 1.662 0.097208 .
## GSK3B_N -5.840e-03 6.824e-03 -0.856 0.392588
## JNK_N -1.188e-02 3.378e-02 -0.352 0.725317
## MEK_N 8.785e-03 2.197e-02 0.400 0.689510
## TRKA_N 5.722e-03 1.646e-02 0.348 0.728257
## RSK_N -2.299e-02 3.915e-02 -0.587 0.557214
## APP_N -5.345e-03 1.825e-02 -0.293 0.769751
## Bcatenin_N 1.107e-03 5.298e-03 0.209 0.834546
## SOD1_N -6.241e-03 3.985e-03 -1.566 0.118018
## MTOR_N 4.161e-02 1.651e-02 2.520 0.012048 *
## P38_N -1.619e-02 1.348e-02 -1.201 0.230267
## pMTOR_N -9.530e-03 7.667e-03 -1.243 0.214475
## DSCR1_N -6.849e-03 1.023e-02 -0.670 0.503374
## AMPKA_N 2.522e-02 2.294e-02 1.099 0.272263
## NR2B_N 7.391e-03 9.146e-03 0.808 0.419449
## pNUMB_N -6.467e-03 2.006e-02 -0.322 0.747326
## RAPTOR_N 4.750e-02 2.465e-02 1.927 0.054629 .
## TIAM1_N -3.284e-02 1.956e-02 -1.679 0.093892 .
## pP70S6_N 2.739e-03 5.307e-03 0.516 0.606022
## NUMB_N -3.304e-02 3.784e-02 -0.873 0.383042
## P70S6_N -4.357e-03 5.119e-03 -0.851 0.395118
## pGSK3B_N 1.291e-01 4.063e-02 3.179 0.001577 **
## pPKCG_N -7.636e-03 1.962e-03 -3.893 0.000113 ***
## CDK5_N 7.111e-04 9.807e-03 0.073 0.942225
## S6_N 1.566e-02 7.647e-03 2.049 0.041059 *
## ADARB1_N -2.609e-03 2.040e-03 -1.279 0.201557
## AcetylH3K9_N 2.113e-03 1.103e-02 0.192 0.848199
## RRP1_N -1.488e-02 1.038e-02 -1.434 0.152232
## BAX_N -1.304e-01 3.542e-02 -3.680 0.000260 ***
## ARC_N 1.687e-01 6.748e-02 2.499 0.012775 *
## nNOS_N -4.749e-03 2.854e-02 -0.166 0.867918
## Tau_N 7.082e-02 1.810e-02 3.913 0.000104 ***
## GFAP_N -4.703e-02 5.400e-02 -0.871 0.384291
## GluR3_N -7.133e-03 2.436e-02 -0.293 0.769812
## GluR4_N -7.030e-02 3.373e-02 -2.084 0.037672 *
## IL1B_N 2.845e-02 1.081e-02 2.631 0.008794 **
## P3525_N 4.994e-02 2.423e-02 2.061 0.039810 *
## pCASP9_N 8.118e-03 3.044e-03 2.667 0.007923 **
## PSD95_N 2.379e-02 3.811e-03 6.242 9.55e-10 ***
## SNCA_N -1.137e-02 3.484e-02 -0.326 0.744195
## Ubiquitin_N 2.330e-03 4.883e-03 0.477 0.633468
## pGSK3B_Tyr216_N 2.808e-02 1.006e-02 2.793 0.005439 **
## SHH_N 4.904e-02 1.998e-02 2.454 0.014471 *
## BAD_N -3.592e-02 3.508e-02 -1.024 0.306311
## BCL2_N 6.173e-03 3.112e-02 0.198 0.842840
## pS6_N NA NA NA NA
## pCFOS_N -1.232e-02 3.111e-02 -0.396 0.692240
## SYP_N 1.354e-02 1.079e-02 1.254 0.210395
## H3AcK18_N 2.401e-03 2.527e-02 0.095 0.924349
## EGR1_N -6.016e-03 2.619e-02 -0.230 0.818406
## H3MeK4_N 1.227e-02 2.154e-02 0.570 0.569097
## CaNA_N -1.749e-03 3.419e-03 -0.512 0.609159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005885 on 476 degrees of freedom
## (528 observations deleted due to missingness)
## Multiple R-squared: 0.8641, Adjusted R-squared: 0.8427
## F-statistic: 40.37 on 75 and 476 DF, p-value: < 2.2e-16
In theory, based on this model, we can talk about the influence of some proteins on the ERBB4_N expression. But you must first diagnose the model, is it correct to use such a model. For this purpose, we will carry out its diagnostics.
I have great suspicions that we will observe multicollinearity in this model, since proteins do not work separately, but are linked to each other in molecular cascades.
At first, I ran into a difficulty when checking for multicollinearity, namely: there are aliased coefficients in the model. This happens (as I read) when faced with ideal multicollinearity. Such a predictor can be identified
alias( lm( ERBB4_N ~ . - MouseID - measurement - Genotype - Treatment -Behavior - class , mouses ) )
## Model :
## ERBB4_N ~ (MouseID + measurement + DYRK1A_N + ITSN1_N + BDNF_N +
## NR1_N + NR2A_N + pAKT_N + pBRAF_N + pCAMKII_N + pCREB_N +
## pELK_N + pERK_N + pJNK_N + PKCA_N + pMEK_N + pNR1_N + pNR2A_N +
## pNR2B_N + pPKCAB_N + pRSK_N + AKT_N + BRAF_N + CAMKII_N +
## CREB_N + ELK_N + ERK_N + GSK3B_N + JNK_N + MEK_N + TRKA_N +
## RSK_N + APP_N + Bcatenin_N + SOD1_N + MTOR_N + P38_N + pMTOR_N +
## DSCR1_N + AMPKA_N + NR2B_N + pNUMB_N + RAPTOR_N + TIAM1_N +
## pP70S6_N + NUMB_N + P70S6_N + pGSK3B_N + pPKCG_N + CDK5_N +
## S6_N + ADARB1_N + AcetylH3K9_N + RRP1_N + BAX_N + ARC_N +
## nNOS_N + Tau_N + GFAP_N + GluR3_N + GluR4_N + IL1B_N + P3525_N +
## pCASP9_N + PSD95_N + SNCA_N + Ubiquitin_N + pGSK3B_Tyr216_N +
## SHH_N + BAD_N + BCL2_N + pS6_N + pCFOS_N + SYP_N + H3AcK18_N +
## EGR1_N + H3MeK4_N + CaNA_N + Genotype + Treatment + Behavior +
## class) - MouseID - measurement - Genotype - Treatment - Behavior -
## class
##
## Complete :
## (Intercept) DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N pBRAF_N pCAMKII_N
## pS6_N 0 0 0 0 0 0 0 0 0
## pCREB_N pELK_N pERK_N pJNK_N PKCA_N pMEK_N pNR1_N pNR2A_N pNR2B_N
## pS6_N 0 0 0 0 0 0 0 0 0
## pPKCAB_N pRSK_N AKT_N BRAF_N CAMKII_N CREB_N ELK_N ERK_N GSK3B_N JNK_N
## pS6_N 0 0 0 0 0 0 0 0 0 0
## MEK_N TRKA_N RSK_N APP_N Bcatenin_N SOD1_N MTOR_N P38_N pMTOR_N DSCR1_N
## pS6_N 0 0 0 0 0 0 0 0 0 0
## AMPKA_N NR2B_N pNUMB_N RAPTOR_N TIAM1_N pP70S6_N NUMB_N P70S6_N pGSK3B_N
## pS6_N 0 0 0 0 0 0 0 0 0
## pPKCG_N CDK5_N S6_N ADARB1_N AcetylH3K9_N RRP1_N BAX_N ARC_N nNOS_N Tau_N
## pS6_N 0 0 0 0 0 0 0 1 0 0
## GFAP_N GluR3_N GluR4_N IL1B_N P3525_N pCASP9_N PSD95_N SNCA_N Ubiquitin_N
## pS6_N 0 0 0 0 0 0 0 0 0
## pGSK3B_Tyr216_N SHH_N BAD_N BCL2_N pCFOS_N SYP_N H3AcK18_N EGR1_N
## pS6_N 0 0 0 0 0 0 0 0
## H3MeK4_N CaNA_N
## pS6_N 0 0
lin_mod <- update(lin_mod, .~. - pS6_N)
Now we can continue diagnostics
head(vif(lin_mod))
## DYRK1A_N ITSN1_N BDNF_N NR1_N NR2A_N pAKT_N
## 23.73661 59.47364 16.66961 96.31168 25.12792 21.88013
As expected: we see multicollinearity among the majority of predictors, and a very large (96!)
On the plots of residuals and normality of distribution, we see outliers
lin_mod <- data.frame(fortify(lin_mod))
gg_resid <- ggplot(data = lin_mod, aes(x = .fitted, y = .stdresid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(method = "lm") +
geom_hline(yintercept = 2, color = "red") +
geom_hline(yintercept = -2, color = "red") +
ggtitle("Plot of residuals from predicted values")
gg_resid
## `geom_smooth()` using formula 'y ~ x'
qqPlot(lin_mod$.stdresid)
## [1] 37 389
I consider it is better to build a linear model from the most significant predictors
lin_mod2 <- lm(ERBB4_N ~ PSD95_N + BAX_N + Tau_N + pPKCG_N , mouses)
summary(lin_mod2)
##
## Call:
## lm(formula = ERBB4_N ~ PSD95_N + BAX_N + Tau_N + pPKCG_N, data = mouses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.033795 -0.006833 -0.000175 0.006176 0.050748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0584628 0.0038119 15.337 < 2e-16 ***
## PSD95_N 0.0334146 0.0013381 24.971 < 2e-16 ***
## BAX_N 0.0679042 0.0183684 3.697 0.000229 ***
## Tau_N 0.0727048 0.0048728 14.921 < 2e-16 ***
## pPKCG_N -0.0024341 0.0005757 -4.228 2.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01072 on 1075 degrees of freedom
## Multiple R-squared: 0.4955, Adjusted R-squared: 0.4936
## F-statistic: 264 on 4 and 1075 DF, p-value: < 2.2e-16
Here we see the absence of multicollinearity, however, the residual graph still looks bad, there are many outliers.
vif(lin_mod2)
## PSD95_N BAX_N Tau_N pPKCG_N
## 1.087134 1.121878 1.060838 1.040219
lin_mod2 <- data.frame(fortify(lin_mod2))
gg_resid2 <- ggplot(data = lin_mod2, aes(x = .fitted, y = .stdresid)) +
geom_point() +
geom_hline(yintercept = 0) +
geom_smooth(method = "lm") +
geom_hline(yintercept = 2, color = "red") +
geom_hline(yintercept = -2, color = "red") +
ggtitle("Plot of residuals from predicted values")
gg_resid2
## `geom_smooth()` using formula 'y ~ x'
qqPlot(lin_mod2$.stdresid)
## [1] 564 794
We visualize the predicted values on an artificial dataset, where the most significant predictor PSD95_N is in the range from the minimum to the maximum value, and the rest of the predictors are equal to the mean.
model <- lm(ERBB4_N ~ PSD95_N + BAX_N + Tau_N + pPKCG_N , data = mouses)
# Dataset for model prediction
MyData <- data.frame(
PSD95_N = seq(min(mouses$PSD95_N), max(mouses$PSD95_N), length.out = 500),
BAX_N = mean(mouses$BAX_N),
pPKCG_N = mean(mouses$pPKCG_N),
Tau_N = mean(mouses$Tau_N)
)
# Predicted values
Predictions <- predict(model, newdata = MyData, interval = 'confidence')
MyData <- data.frame(MyData, Predictions)
# Model prediction plot
Pl_predict <- ggplot(MyData, aes(x = PSD95_N, y = fit)) +
geom_ribbon(alpha = 0.2, aes(ymin = lwr, ymax = upr)) +
geom_line() +
geom_point(data= mouses, alpha = 0.2, color = 'plum4' , aes(y = ERBB4_N)) +
ggtitle("Multiple model prediction plot") +
labs(y="ERBB4_N - fitted")
Pl_predict
As a result, I built a model for predicting the expression of ERBB4_N depending on the expression of only a few proteins (the most significant ones). Since the model does not meet the conditions of applicability, I would not recommend using it.
Principal component analysis allows us to reduce the dimension of the data, describe the relationship between a large number of features and rank them by importance.
Our original dataset is very large, so we will only use a part of it, namely, we will use a data subset containing only complete observations. We will leave in it only information about the expression of various proteins and about the class of the mouse. Let’s also look at the representation of classes.
mice_set <- mouses_without_na[, -c(1, 2, 79, 80, 81,82)]
table(mice_set$class)
##
## c-CS-m c-CS-s c-SC-m c-SC-s t-CS-m t-CS-s t-SC-m t-SC-s
## 45 75 60 75 90 75 60 72
Let’s do a principal component analysis for our data.
mouse_pca <- rda(mice_set[, -77], scale = T)
First, let’s find out how the components contribute. To do this, turn to the summary and visualize the Proportion Explained.
pca_summary <- summary(mouse_pca)
pca_result <- as.data.frame(pca_summary$cont)
plot_data <- as.data.frame(t(pca_result[c("Proportion Explained"),]))
plot_data$component <- c(1:75)
ggplot(plot_data, aes(component , `Proportion Explained`)) + geom_bar(stat = "identity") +ggtitle(label = "Component contribution")
The first three components explain approximately 30, 17 and 10% of the variability.
Also, to understand which components contribute to the greatest contribution, we plot the eigenvalues.
screeplot(mouse_pca, type = "lines", bstick = TRUE)
Let’s choose 4 components for analysis.
Loading plots allow you to evaluate the correlation of components with each other, as well as with the component axes. Let’s look at some of these plots.
biplot(mouse_pca, display = "species", scaling = 2, choices = c(1,2))
biplot(mouse_pca, display = "species", scaling = 2, choices = c(1,4))
biplot(mouse_pca, display = "species", scaling = 2, choices = c(2,3))
biplot(mouse_pca, display = "species", scaling = 2, choices = c(2,4))
biplot(mouse_pca, display = "species", scaling = 2, choices = c(3,4))
Unfortunately, due to the large amount of proteins, these graphs are not very descriptive.
Let’s look at the ordination plots, which reflects the ordination / location of individual objects (mice) in space. We see that it is difficult to single out certain clusters. The most pronounced patterns are on the ordination chart for components 2 and 4. On it you can see that the mice are grouped by the context of the experiment (CS or SC).
df_scores <- data.frame(mice_set,
scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))
ggplot(df_scores, aes(x = PC4, y = PC2, colour = class)) +
geom_point() + ggtitle("Ordination in PC4 and PC2 space")
df_scores <- data.frame(mice_set,
scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))
ggplot(df_scores, aes(x = PC1, y = PC2, colour = class)) +
geom_point() + ggtitle("Ordination in PC1 and PC2 space")
df_scores <- data.frame(mice_set,
scores(mouse_pca, display = "sites", choices = c(1:5), scaling = 1))
ggplot(df_scores, aes(x = PC1, y = PC3, colour = class)) +
geom_point() + ggtitle("Ordination in PC1 and PC3 space")
Among the packages for 3D visualization of the first three components (plot3D, rgl and plotly), I chose plotly because it is very simple and intuitive, and as a result, you can get an interactive graph! It is difficult to single out individual patterns on the chart. Perhaps the averaged expression values for each mouse should have been taken, since 72 points would be better distinguishable on the graph.
fig <- plot_ly(df_scores, x =~PC1, y = ~PC2, z = ~PC3, color = ~class, size = 0.1 )
fig